We can take a first pass at annotating the clusters from our 10x data by comparing them to the reference dataset that we put together. This was done by compiling celltype-averaged expression values from the following single cell RNA-seq studies: (Pavličev et al. 2017; Vento-Tormo et al. 2018; Suryawanshi et al. 2018).
Data
Samples
This is the sample information sent by Alice. Not sure about the INP id for AL09 and AL10. The library for INP188 villi (associated with AL07) was problematic, so is not used.
sample.info <- read.delim("../info/from-alice/sample_info.tsv", stringsAsFactors = FALSE)
sample.info %>% kable(caption = "Sample information") %>% kable_styling(full_width = FALSE)
Sample information
| sample_id |
inp_id |
covid |
tissue |
| AL01 |
INP88 |
covid |
decidua |
| AL02 |
INP88 |
covid |
villi |
| AL03 |
INP90 |
covid |
decidua |
| AL04 |
INP90 |
covid |
villi |
| AL05 |
INP187 |
cntrl |
decidua |
| AL06 |
INP187 |
cntrl |
villi |
| AL07 |
INP188 |
cntrl |
decidua |
| AL08 |
INP188 |
cntrl |
villi |
| AL09 |
|
cntrl |
decidua |
| AL10 |
|
cntrl |
decidua |
Seurat-processed data
Read the Seurat-processed (by Eric) data sent by Alice.
seur <- readRDS("../info/from-alice/placenta.rds")
Error in deparse(attr(val, "_rs_call")) :
no slot of name "images" for this object of class "Seurat"
Error: no more error handlers available (recursive errors?); invoking 'abort' restart
head(seur@meta.data)
The orig.ident column in the metadata represents the sample IDs from the sample info table above. We just have rename them to left-pad the numbers for consistency, and set new cluster names as default idents.
# left pad orig.idents
seur@meta.data$orig.ident <- gsub("(AL)(\\d)$", "\\10\\2", seur@meta.data$orig.ident)
# add additional metadata (covid status, tissue)
seur@meta.data$covid <- sample.info$covid[match(x = seur@meta.data$orig.ident,
table = sample.info$sample_id)]
seur@meta.data$tissue <- sample.info$tissue[match(x = seur@meta.data$orig.ident,
table = sample.info$sample_id)]
# rename clusters for consistency
seur$seurat_clusters <- paste0("clust_", seur$seurat_clusters)
seur$seurat_clusters <- gsub("(clust_)(\\d$)", "\\10\\2", as.character(seur$seurat_clusters)) # left pad
seur$seurat_clusters <- factor(seur$seurat_clusters, levels = paste0("clust_", c("00", "01", "02", "03", "04", "05", "06", "07", "08", "09", 10:34)))
# set seurat clusters as default idents
Idents(seur) <- seur@meta.data$seurat_clusters
Average by cluster
We have to get transcriptomes averaged by clusters, so that we can compare them with the reference data in order to narrow down annotations. For averaging, we will use sctransform normalized data since this is what we use for marker gene detection down the line.
# average sctransformed expression
seur.avg <- AverageExpression(seur)
# averaged data to use for correlations (SCT transformed)
plac <- seur.avg$SCT
Reference data
# read reference datasets
ref <- read.csv("../results/01_reference-atlas/vento-surya-pavli_joined.csv", stringsAsFactors = FALSE)
Get data in shape
We will join the data with the reference data so they are both in the same dataframe.
# add gene names column
plac$gene_name <- rownames(plac)
# inner join
plac <- dplyr::inner_join(plac, ref, by = "gene_name")
head(plac)
Correlations within the data
Before moving to annotating clusters by comparison with the reference dataset, first we will look at the correlation structure within the dataset.
Correlation matrix
# build correlation matrix from expression data
cor.matrix <- cor(plac %>% dplyr::select(starts_with("clust")), method = 'spearman')
# reorder correlation matrix based on clustering
dd <- as.dist((1 - cor.matrix))
hc <- hclust(dd, method = 'complete')
cormat <- cor.matrix[hc$order, hc$order]
# melt correlation matrix
dat <- reshape2::melt(cormat, na.rm = T)
## plot
p <- ggplot(data = dat, aes(Var1, Var2, fill = value)) +
geom_tile(colour = "white") +
scale_fill_gradient(low = 'white', high = 'red',
name = "Spearman\nCorrelation") +
coord_fixed(ratio = 1) +
labs(title = "Correlations among clusters") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, size = 8,
hjust=1, vjust = 0.5),
axis.text.y = element_text(size=8),
axis.ticks.length = unit(0.15, units = c('lines')),
legend.title = element_text(size = 10),
axis.title = element_blank(),
panel.grid = element_blank(),
panel.border = element_blank(),
legend.key.size = unit(0.8, units = c('lines')))
ggsave(p, filename = "../results/02_annotation/plots/corrplot_clusters.pdf",
device = "pdf", width = 7, height = 6, units = "in")
print(p)

Heirarchical clustering
# build correlation matrix from expression data
cor.matrix <- cor(plac %>% dplyr::select(starts_with("clust")), method = 'spearman')
# reorder correlation matrix based on clustering
dd <- as.dist((1 - cor.matrix))
hc <- hclust(dd, method = 'complete')
pdf(file = "../results/02_annotation/plots/hclust_clusters.pdf", width = 7, height = 5)
plot(hc, cex = 0.8)
dev.off()
null device
1
plot(hc, cex = 0.8)

There are three big cluster blocks, with substructure within them. The cluster blocks also correspond well with the UMAP plot that was sent by Alice.
Annotation by correlation
Let’s now actually measure correlations between all clusters with the reference datasets. We will treat our data as query dataset and for each cluster assign top 3 cell types in each reference dataset.
refdata <- c("vento", "surya", "pavli")
# build correlation matrix from expression data
cor.matrix <- cor(plac[, names(plac)[names(plac) != "gene_name"]],
method = 'spearman')
# reorder correlation matrix based on clustering
dd <- as.dist((1 - cor.matrix))
hc <- hclust(dd, method = 'complete')
cor.matrix <- cor.matrix[hc$order, hc$order]
# melt correlation matrix
cormat <- reshape2::melt(cor.matrix, na.rm = T)
cormat$Var1_source <- sapply(strsplit(as.character(cormat$Var1), split = "_"), "[[", 1)
cormat$Var2_source <- sapply(strsplit(as.character(cormat$Var2), split = "_"), "[[", 1)
# subset
cormat <- cormat[which(cormat$Var1_source %in% refdata &
cormat$Var2_source == "clust"), ]
# top 3 matches with highest correlation coefficients
topmatch <- data.frame( # empty df to fill top matches from the loop below
cluster = unique(as.character(cormat$Var2)) %>% sort(),
vento.top1 = NA,
surya.top1 = NA,
pavli.top1 = NA,
vento.top2 = NA,
surya.top2 = NA,
pavli.top2 = NA,
vento.top3 = NA,
surya.top3 = NA,
pavli.top3 = NA
)
cormat$top3 <- NA
for(i in unique(cormat$Var2)){
for(j in refdata){
# identify top3 match indices
ind <- which(cormat$Var2 == i & cormat$Var1_source == j)
val <- cormat$value[ind]
top3ind <- ind[order(val, decreasing = TRUE)[1:3]]
# assign match ranking to correlation data for plotting
cormat$top3[top3ind[1]] <- "1"
cormat$top3[top3ind[2]] <- "2"
cormat$top3[top3ind[3]] <- "3"
# assign top matches to topmatches dataframe
topmatch[topmatch$cluster == i, paste0(j, ".top1")] <- gsub(paste0(j, "_"), "", as.character(cormat$Var1[top3ind[1]]))
topmatch[topmatch$cluster == i, paste0(j, ".top2")] <- gsub(paste0(j, "_"), "", as.character(cormat$Var1[top3ind[2]]))
topmatch[topmatch$cluster == i, paste0(j, ".top3")] <- gsub(paste0(j, "_"), "", as.character(cormat$Var1[top3ind[3]]))
}
}
Plots
# plotting function
clustAnnoPlot <- function(dat, query, reference){
dat <- dat[which(dat$Var2_source == query & dat$Var1_source == reference), ]
p <- ggplot(data = dat,
aes(Var1, Var2, fill = value)) +
geom_tile(colour = "white") +
scale_fill_gradient(low = 'white', high = 'red',
name = "Spearman\nCorrelation") +
geom_point(aes(Var1, Var2, alpha = top3),
size = 1.5, shape = 19, stroke = 0) +
scale_alpha_manual(values = c(1, 0.5, 0.25),
breaks = c(1, 2, 3),
name = "top3", na.value = 0) +
coord_fixed(ratio = 1) +
xlab("reference") +
ylab("query") +
labs(caption = "For each celltype in query, black points represent top 3 celltypes from reference with highest correlation.",
title = paste0(query, " vs. ", reference)) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, size = 8,
hjust=1, vjust = 0.5),
axis.text.y = element_text(size=8),
axis.ticks.length = unit(0.15, units = c('lines')),
legend.title = element_text(size = 10),
axis.title = element_text(size = 8),
plot.caption = element_text(size = 7),
panel.grid = element_blank(),
panel.border = element_blank(),
legend.key.size = unit(0.8, units = c('lines')))
return(p)
}
Against Vento-Tormo et al
## Annotation plots
anno.vento <- clustAnnoPlot(dat = cormat, query = "clust", reference = "vento")
cowplot::ggsave2(anno.vento, device = "pdf", width = 7.5, height = 6.5, units = "in",
filename = "../results/02_annotation/plots/annotation-by-correlation_ref-vento.pdf")
print(anno.vento)

Against Suryavanshi et al
## Annotation plots
anno.surya <- clustAnnoPlot(dat = cormat, query = "clust", reference = "surya")
cowplot::ggsave2(anno.surya, device = "pdf", width = 7.5, height = 6.5, units = "in",
filename = "../results/02_annotation/plots/annotation-by-correlation_ref-surya.pdf")
print(anno.surya)

Against Pavlicev et al
## Annotation plots
anno.pavli <- clustAnnoPlot(dat = cormat, query = "clust", reference = "pavli")
cowplot::ggsave2(anno.pavli, device = "pdf", width = 7, height = 6, units = "in",
filename = "../results/02_annotation/plots/annotation-by-correlation_ref-pavli.pdf")
print(anno.pavli)

Top matches
topmatch %>% kable(caption = "Top 3 matches against all reference datasets.") %>% kable_styling(full_width = FALSE)
Top 3 matches against all reference datasets.
| cluster |
vento.top1 |
surya.top1 |
pavli.top1 |
vento.top2 |
surya.top2 |
pavli.top2 |
vento.top3 |
surya.top3 |
pavli.top3 |
| clust_00 |
dS3 |
dec.DSC |
DEC |
dS2 |
dec.FB1 |
ESF |
dP2 |
dec.FB2 |
SYN |
| clust_01 |
dM2 |
dec.MAC |
SYN |
dM1 |
vil.HC |
DEND |
HB |
dec.DC2 |
DEC |
| clust_02 |
dP2 |
dec.FB1 |
DEC |
dS2 |
dec.FB2 |
SYN |
dS3 |
dec.DSC |
ESF |
| clust_03 |
dM1 |
dec.MAC |
SYN |
MO |
vil.HC |
DEND |
dM2 |
dec.NK1 |
DEC |
| clust_04 |
Endo.m |
dec.VEC |
DEC |
Endo.L |
dec.LEC |
ESF |
Endo.f |
vil.VEC |
SYN |
| clust_05 |
dNK3 |
dec.NK1 |
DEC |
dNK2 |
dec.TC |
ESF |
Tcells |
dec.MAC |
SYN |
| clust_06 |
fFB1 |
dec.FB1 |
DEC |
dP2 |
vil.FB3 |
SYN |
dS2 |
dec.FB2 |
ESF |
| clust_07 |
fFB1 |
vil.FB3 |
DEC |
dP2 |
dec.FB1 |
ESF |
dS2 |
dec.FB2 |
SYN |
| clust_08 |
dM1 |
dec.MAC |
DEND |
MO |
vil.HC |
SYN |
dM2 |
dec.DC2 |
DEC |
| clust_09 |
Tcells |
dec.TC |
DEC |
dNK3 |
dec.NK1 |
ESF |
dNK2 |
dec.NK2 |
SYN |
| clust_10 |
dNK3 |
dec.NK1 |
DEC |
dNK2 |
dec.TC |
ESF |
dNK1 |
dec.NK2 |
SYN |
| clust_11 |
dM1 |
dec.MAC |
DEND |
dM2 |
vil.HC |
DEC |
dM3 |
dec.DC2 |
SYN |
| clust_12 |
dM1 |
dec.MAC |
SYN |
dM2 |
vil.HC |
DEND |
dM3 |
dec.DC2 |
DEC |
| clust_13 |
EVT |
vil.EVT |
EVT |
SCT |
dec.DSC |
DEC |
dP2 |
dec.FB1 |
ESF |
| clust_14 |
dNK3 |
dec.NK1 |
DEC |
dNK2 |
dec.TC |
ESF |
dNK1 |
dec.NK2 |
SYN |
| clust_15 |
dP2 |
dec.MAC |
DEC |
dM1 |
dec.FB2 |
SYN |
fFB1 |
dec.FB1 |
ESF |
| clust_16 |
VCT |
vil.VCT |
CYT2 |
SCT |
vil.SCT |
CYT3 |
EVT |
vil.EVT |
CYT1 |
| clust_17 |
VCT |
vil.VCT |
CYT2 |
SCT |
vil.SCT |
CYT3 |
EVT |
vil.EVT |
CYT1 |
| clust_18 |
dS3 |
dec.DSC |
DEC |
dS2 |
dec.FB1 |
ESF |
dS1 |
dec.FB2 |
CYT2 |
| clust_19 |
dP1 |
dec.SMC |
DEC |
dP2 |
vil.FB2 |
ESF |
Endo.m |
dec.FB2 |
SYN |
| clust_20 |
Plasma |
dec.MAC |
SYN |
dM1 |
dec.NK1 |
DEC |
dNK3 |
dec.TC |
DEND |
| clust_21 |
Endo.m |
dec.VEC |
DEC |
Endo.f |
vil.VEC |
SYN |
Endo.L |
dec.LEC |
ESF |
| clust_22 |
dM2 |
dec.MAC |
SYN |
dM1 |
vil.HC |
DEC |
HB |
dec.DC2 |
DEND |
| clust_23 |
SCT |
vil.VCT |
CYT2 |
VCT |
vil.SCT |
CYT3 |
EVT |
vil.EVT |
CYT1 |
| clust_24 |
SCT |
vil.SCT |
CYT2 |
VCT |
vil.VCT |
CYT3 |
EVT |
vil.EVT |
SYN |
| clust_25 |
VCT |
vil.VCT |
CYT2 |
SCT |
vil.SCT |
CYT3 |
EVT |
vil.EVT |
CYT1 |
| clust_26 |
dM1 |
dec.MAC |
DEC |
dM2 |
vil.HC |
SYN |
dM3 |
dec.VEC |
ESF |
| clust_27 |
dM2 |
dec.MAC |
SYN |
fFB1 |
vil.HC |
DEC |
HB |
vil.FB3 |
ESF |
| clust_28 |
MO |
dec.MAC |
SYN |
dM1 |
vil.HC |
CYT2 |
dM2 |
dec.NK2 |
DEND |
| clust_29 |
SCT |
vil.SCT |
CYT2 |
VCT |
vil.VCT |
CYT1 |
EVT |
vil.EVT |
CYT3 |
| clust_30 |
dNK.p |
dec.NK2 |
ESF |
dNK1 |
dec.NK1 |
DEND |
dNK3 |
dec.MAC |
CYT2 |
| clust_31 |
dM2 |
dec.MAC |
SYN |
dM1 |
vil.HC |
DEND |
MO |
dec.NK1 |
CYT2 |
| clust_32 |
Tcells |
dec.TC |
SYN |
dNK3 |
dec.NK1 |
DEC |
dM1 |
dec.MAC |
DEND |
| clust_33 |
fFB1 |
vil.FB3 |
SYN |
dP2 |
dec.SMC |
DEC |
dP1 |
dec.FB1 |
ESF |
| clust_34 |
dM2 |
dec.MAC |
SYN |
dM1 |
vil.HC |
DEND |
HB |
dec.DC2 |
DEC |
NA
Labels for clusters
We need to create consistent labels for all cell types that we identify. Below is a list I created in which the top level objects are labels we will give to the cell types, within which contained are lists of corresponding labels from vento and surya. These are tentative labels. After the first pass, when we go through the clusters with a fine-toothed comb, we can modify them further. For example, if we have multiple clusters labeled as dec.Mac (decidual macrophages), and if those clusters are distinct, we can then break up this label into dec.Mac1 and dec.Mac2.
labs <- list("dec.DSC" = list("vento.lab" = c("dS1", "dS2", "dS3"),
"surya.lab" = c("dec.DSC", "dec.FB1", "dec.FB2")),
"dec.DC" = list("vento.lab" = c("DC1", "DC2"),
"surya.lab" = c("dec.DC1", "dec.DC2")),
"dec.Mac" = list("vento.lab" = c("dM1", "dM2", "dM3"),
"surya.lab" = c("dec.MAC")),
"dec.NK" = list("vento.lab" = c("dNK.p", "dNK1", "dNK2", "dNK3", "NK.CD16neg", "NK.CD16pos"),
"surya.lab" = c("dec.NK1", "dec.NK2")),
"dec.SMC" = list("vento.lab" = c("dP1", "dP2"),
"surya.lab" = c("dec.SMC")),
"dec.Endo" = list("vento.lab" = c("Endo.m"),
"surya.lab" = c("dec.VEC")),
"dec.Epi" = list("vento.lab" = c("Epi1", "Epi2"),
"surya.lab" = c("dec.EEC")),
"dec.Tcell" = list("vento.lab" = c("Tcells"),
"surya.lab" = c("dec.TC")),
"vil.Endo" = list("vento.lab" = c("Endo.f"),
"surya.lab" = c("vil.VEC")),
"vil.EVT" = list("vento.lab" = c("EVT"),
"surya.lab" = c("vil.EVT")),
"vil.FB" = list("vento.lab" = c("fFB1", "fFB2"),
"surya.lab" = c("vil.FB1", "vil.FB2", "vil.FB3")),
"vil.Hofb" = list("vento.lab" = c("HB"),
"surya.lab" = c("vil.HC")),
"vil.SCT" = list("vento.lab" = c("SCT"),
"surya.lab" = c("vil.SCT")),
"vil.VCT" = list("vento.lab" = c("VCT"),
"surya.lab" = c("vil.VCT")),
"unk.Gran" = list("vento.lab" = c("Granulocytes"),
"surya.lab" = c()),
"unk.ILC" = list("vento.lab" = c("ILC3"),
"surya.lab" = c()),
"unk.Mono" = list("vento.lab" = c("MO"),
"surya.lab" = c()),
"unk.Plasma" = list("vento.lab" = c("Plasma"),
"surya.lab" = c()),
"unk.EB" = list("vento.lab" = c(),
"surya.lab" = c("vil.EB")),
"unk.Endo.L" = list("vento.lab" = c("Endo.L"),
"surya.lab" = c("dec.LEC"))
)
Matches consistent between reference datasets
The following clusters have consistent top1 match between vento and surya references, i.e. they correspond to the same cell type category. The pavli reference is too unresolved to be used diagnostically, so we will ignore it for now. For some cell types it is confirmatory, though, e.g. cluster_00 corresponds to DSC in all three references.
# find out which clusters are consistent.
# If vento.top1 and surya.top1 are found in the same item in the labs list, the mapping is consistent.
consistent <- c(NA)
for(i in 1:nrow(topmatch)){
consistent[i] <- grep(topmatch$vento.top1[i], labs) == grep(topmatch$surya.top1[i], labs)
}
# subset to include consistent set
topmatch.cons <- topmatch %>%
filter(consistent) %>%
dplyr::select(cluster, vento.top1, surya.top1)
# add our tentative labels to the clusters
for(i in 1:nrow(topmatch.cons)){
topmatch.cons$label[i] <- names(labs)[grep(topmatch.cons$vento.top1[i], labs)]
}
# print
topmatch.cons[order(topmatch.cons$label), ] %>% knitr::kable() %>% kable_styling(full_width = FALSE)
| |
cluster |
vento.top1 |
surya.top1 |
label |
| 1 |
clust_00 |
dS3 |
dec.DSC |
dec.DSC |
| 16 |
clust_18 |
dS3 |
dec.DSC |
dec.DSC |
| 4 |
clust_04 |
Endo.m |
dec.VEC |
dec.Endo |
| 18 |
clust_21 |
Endo.m |
dec.VEC |
dec.Endo |
| 2 |
clust_01 |
dM2 |
dec.MAC |
dec.Mac |
| 3 |
clust_03 |
dM1 |
dec.MAC |
dec.Mac |
| 7 |
clust_08 |
dM1 |
dec.MAC |
dec.Mac |
| 10 |
clust_11 |
dM1 |
dec.MAC |
dec.Mac |
| 11 |
clust_12 |
dM1 |
dec.MAC |
dec.Mac |
| 19 |
clust_22 |
dM2 |
dec.MAC |
dec.Mac |
| 22 |
clust_26 |
dM1 |
dec.MAC |
dec.Mac |
| 23 |
clust_27 |
dM2 |
dec.MAC |
dec.Mac |
| 26 |
clust_31 |
dM2 |
dec.MAC |
dec.Mac |
| 29 |
clust_34 |
dM2 |
dec.MAC |
dec.Mac |
| 5 |
clust_05 |
dNK3 |
dec.NK1 |
dec.NK |
| 9 |
clust_10 |
dNK3 |
dec.NK1 |
dec.NK |
| 13 |
clust_14 |
dNK3 |
dec.NK1 |
dec.NK |
| 25 |
clust_30 |
dNK.p |
dec.NK2 |
dec.NK |
| 17 |
clust_19 |
dP1 |
dec.SMC |
dec.SMC |
| 8 |
clust_09 |
Tcells |
dec.TC |
dec.Tcell |
| 27 |
clust_32 |
Tcells |
dec.TC |
dec.Tcell |
| 12 |
clust_13 |
EVT |
vil.EVT |
vil.EVT |
| 6 |
clust_07 |
fFB1 |
vil.FB3 |
vil.FB |
| 28 |
clust_33 |
fFB1 |
vil.FB3 |
vil.FB |
| 20 |
clust_24 |
SCT |
vil.SCT |
vil.SCT |
| 24 |
clust_29 |
SCT |
vil.SCT |
vil.SCT |
| 14 |
clust_16 |
VCT |
vil.VCT |
vil.VCT |
| 15 |
clust_17 |
VCT |
vil.VCT |
vil.VCT |
| 21 |
clust_25 |
VCT |
vil.VCT |
vil.VCT |
Ambiguous clusters
The top1 matches for some clusters with vento and surya are not the same. The are the clusters that will require a more detailed look to determine their identify.
# subset for ambiguous clusters
topmatch.ambig <- topmatch %>%
filter(!consistent) %>%
dplyr::select(cluster, vento.top1, surya.top1)
# add our labels.
for(i in 1:nrow(topmatch.ambig)){
topmatch.ambig$label[i] <- paste0(
names(labs)[grep(topmatch.ambig$vento.top1[i], labs)],
" or ",
names(labs)[grep(topmatch.ambig$surya.top1[i], labs)]
)
}
# print
topmatch.ambig[order(topmatch.ambig$label), ] %>% kable() %>% kable_styling(full_width = FALSE)
| |
cluster |
vento.top1 |
surya.top1 |
label |
| 1 |
clust_02 |
dP2 |
dec.FB1 |
dec.SMC or dec.DSC |
| 3 |
clust_15 |
dP2 |
dec.MAC |
dec.SMC or dec.Mac |
| 6 |
clust_28 |
MO |
dec.MAC |
unk.Mono or dec.Mac |
| 4 |
clust_20 |
Plasma |
dec.MAC |
unk.Plasma or dec.Mac |
| 2 |
clust_06 |
fFB1 |
dec.FB1 |
vil.FB or dec.DSC |
| 5 |
clust_23 |
SCT |
vil.VCT |
vil.SCT or vil.VCT |
Annotation refinement
The annotations we have so far are only a starting point. Now we can look at each annotation more carefully to make sure that everything makes sense.
Sample contribution to clusters
One of the ways in which we can refine the clusters further is by knowing which tissue that sample arises from. For example, if a cluster has macrophage gene signature and if most cells in that cluster come from villi samples, we can further narrow down the identity of the cluster to Hofbauer cells. We can do this by simply counting for each cluster how many cells come from decidua vs villi and by calculating the percentage.
# count cells in each cluster by tissue of origin
bytissue <- table(seur$tissue, seur$seurat_clusters) %>%
as.data.frame() %>%
dplyr::rename(tissue = Var1, cluster = Var2, frequency = Freq)
# calculate fraction
for(i in 1:nrow(bytissue)){
bytissue$fraction[i] <- bytissue$frequency[i]/
sum(bytissue$frequency[bytissue$cluster == bytissue$cluster[i]])
}
# plot
p.cells <- ggplot(data = bytissue, aes(x = frequency, y = cluster)) +
geom_bar(aes(fill = tissue), stat = "identity", position = "stack") +
ggtitle("Number of cells")
p.frac <- ggplot(data = bytissue, aes(x = fraction, y = cluster)) +
geom_bar(aes(fill = tissue), stat = "identity", position = "stack") +
ggtitle("Fraction of cells")
p.cells + p.frac + plot_layout(guides = "collect", nrow = 2) +
plot_annotation(caption = "Sample contribution to clusters") &
coord_flip() &
scale_fill_tableau(palette = "Classic 10 Medium") &
theme_minimal() +
theme(panel.grid.minor = element_blank(),
panel.grid.major = element_line(size = 0.25),
plot.title = element_text(size = 10),
axis.text.x = element_text(angle = 90, vjust = 0.5))

NA
NA
Following clusters have more than 75% cells coming from the same tissue of origin.
bytissue[, c("cluster", "tissue", "fraction")] %>%
pivot_wider(names_from = "tissue", values_from = "fraction") %>%
filter(decidua > 0.75 | villi > 0.75) %>%
kable(digits = 2, row.names = FALSE,
caption = "Clusters with more than 75% cells from one tissue") %>%
kable_styling(full_width = FALSE)
Clusters with more than 75% cells from one tissue
| cluster |
decidua |
villi |
| clust_00 |
0.98 |
0.02 |
| clust_01 |
0.16 |
0.84 |
| clust_04 |
0.87 |
0.13 |
| clust_05 |
0.94 |
0.06 |
| clust_06 |
0.24 |
0.76 |
| clust_12 |
0.82 |
0.18 |
| clust_13 |
0.93 |
0.07 |
| clust_15 |
0.78 |
0.22 |
| clust_16 |
0.87 |
0.13 |
| clust_17 |
0.99 |
0.01 |
| clust_18 |
0.92 |
0.08 |
| clust_19 |
0.77 |
0.23 |
| clust_21 |
0.76 |
0.24 |
| clust_22 |
0.12 |
0.88 |
| clust_24 |
0.77 |
0.23 |
| clust_25 |
0.85 |
0.15 |
| clust_26 |
0.22 |
0.78 |
| clust_27 |
0.02 |
0.98 |
| clust_29 |
0.79 |
0.21 |
| clust_31 |
0.02 |
0.98 |
| clust_33 |
0.04 |
0.96 |
| clust_34 |
0.03 |
0.97 |
Following clusters have ambiguous origin, i.e. they contain cells from decidua and villi samples.
bytissue[, c("cluster", "tissue", "fraction")] %>%
pivot_wider(names_from = "tissue", values_from = "fraction") %>%
filter(decidua < 0.75 & villi < 0.75) %>%
kable(digits = 2, row.names = FALSE,
caption = "Clusters with cells from both tissues") %>%
kable_styling(full_width = FALSE)
Clusters with cells from both tissues
| cluster |
decidua |
villi |
| clust_02 |
0.65 |
0.35 |
| clust_03 |
0.55 |
0.45 |
| clust_07 |
0.29 |
0.71 |
| clust_08 |
0.38 |
0.62 |
| clust_09 |
0.57 |
0.43 |
| clust_10 |
0.59 |
0.41 |
| clust_11 |
0.44 |
0.56 |
| clust_14 |
0.67 |
0.33 |
| clust_20 |
0.44 |
0.56 |
| clust_23 |
0.74 |
0.26 |
| clust_28 |
0.66 |
0.34 |
| clust_30 |
0.69 |
0.31 |
| clust_32 |
0.41 |
0.59 |
This doesn’t necessarily clarify too many things. There are many clusters that have over 3/4th of the cells coming from the same tissue, but often not the tissue you expect. For example clusters 16 and 17 are either villous or syncytial trophoblast cells but they are coming mostly from decidua samples. You expect that for cluster 13, which is extravillous trophoblast but not for the other types. At least in the species that I have worked with, it is often difficult to separate the fetal tissue from maternal tissue. So this mixed result could be a result of that difficulty. It looks like we can use this a rough guide, but we can’t necessarily rely of this analysis, i.e. the marker gene should take precedence over tissue of origin in cell type id because the latter is experimentally (and as a consequence, analytically) messy to disentangle.
Marker genes
Identify marker genes
I ran Seurat::findAllMarkers on this data using the SCT assay. This function outputs a list of marker genes for each cluster compared to all other cells in the data. This takes a while (hours) to run, so I ran it on the cluster and saved the results to a .csv. See the code directory for full code, and see below for the function call.
# set default assay to SCT
DefaultAssay(object = plac) <- "SCT"
# find markers for all clusters
markers <- FindAllMarkers(object = plac,
only.pos = TRUE,
logfc.threshold = 0.25)
# write.csv
write.csv(markers, "/home/arc78/scratch60/covid-placenta/markers_sct.csv", row.names = FALSE)
markers <- read.csv("../results/02_annotation/files/markers_sct.csv")
# rename clusters
markers$cluster <- paste0("clust_", markers$cluster)
markers$cluster <- gsub("(clust_)(\\d)$", "\\10\\2", markers$cluster)
markers$cluster <- factor(markers$cluster, levels = unique(markers$cluster))
# split by cluster
markers <- split.data.frame(markers, markers$cluster)
Plot marker genes
We will plot top 50 marker genes for all clusters and save the plot to pdf.
# plotting function
DotPlot2 <- function(object = "seur", assay = "SCT", features, title, ...) {
p <- DotPlot(seur,
assay = "SCT",
features = features,
dot.min = 0.05, dot.scale = 4) +
coord_flip() +
labs(caption = paste0(assay, " normalized expression"), title = title) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
panel.grid = element_blank())
return(p)
}
top50marker.plots <- list()
for(i in names(markers)) {
if(nrow(markers[[i]]) > 50)
features <- markers[[i]]$gene[1:50]
if(nrow(markers[[i]]) < 50)
features <- markers[[i]]$gene
top50marker.plots[[i]] <- DotPlot2(features = features, title = i)
cowplot::ggsave2(top50marker.plots[[i]],
filename = paste0("../results/02_annotation/plots/top50markers_dotplot_", i, ".pdf"),
width = 7.5, height = 7.5, units = "in")
}
Manual annotation
With the marker gene plots we have made, we can examine the top marker genes in each cluster to manually confirm or adjust the annotation of each cluster.
Let’s make a dataframe in which to keep track of the assignments.
annotation <- data.frame(
cluster = unique(seur$seurat_clusters) %>% sort(),
annotation = NA,
notes = NA
)
For each set of marker genes, we can also perform a number of manual checks to assess cell type annotation.
First is GO enrichment using clusterProfile package, for which below is a function for quickly looking at top 10 enriched GO categories.
# function for GO enrichment
# 1. for a given cluster, prints n.print.rows top enriched GO terms
# 2. Can exclude ribosomal genes (default) because they dominate some GO enrichments
enrichGO2 <- function(clust, ngenes = 100, include.ribo = FALSE, n.print.rows = 10, ...) {
ids.bitr <- bitr(markers[[clust]]$gene[1:ngenes],
fromType = "SYMBOL", toType = c("ENTREZID", "SYMBOL"),
OrgDb = "org.Hs.eg.db")
if(include.ribo == FALSE)
features <- ids.bitr$ENTREZID[!(grepl("RP[SL]", ids.bitr$SYMBOL))]
if(include.ribo == TRUE)
features <- ids.bitr$ENTREZID
ego <- enrichGO(
gene = features,
OrgDb = org.Hs.eg.db,
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.01, # default 0.01
qvalueCutoff = 0.05, # default 0.05
readable = TRUE)
dftoprint <- clusterProfiler::simplify(ego)@result[1:n.print.rows, c(2,8)]
rownames(dftoprint) <- NULL
return(dftoprint)
}
We can also check if the cluster is enriched is genes that are labeled as markers genes for a cell type by other studies. (Vento-Tormo et al. 2018) have provided top 30 marker genes for all cell types identified by them.
# read top 30 markers from vento-tormo et al
markers.vento <- readxl::read_excel("../info/from-papers/vento-tormo_2018/41586_2018_698_MOESM1_ESM/Supplementary Table 2.xlsx")
# rename columns
markers.vento <- dplyr::rename(markers.vento,
VCT_1 = VCT...2,
Tcells_1 = `T cells...5`,
Tcells_2 = `T cells...9`,
VCT_2 = `VCT...10`,
Tcells_3 = `T cells...11`,
FB_1 = F1,
EVT_1 = `EVT...14`,
Tcells_4 = `T cells...15`,
EVT_2 = `EVT...16`,
NKCD16pos = `Blood NK CD16+`,
MO_1 = `MO...22`,
MO_2 = `MO...27`,
NKCD16neg = `Blood NK CD16-`,
FB2 = F2,
Endo.m = `Endo (m)`,
Endo.L = `Endo L`,
Endo.f = `Endo (f)`
)
# remove ensembl ids
markers.vento <- apply(X = markers.vento, MARGIN = 2, FUN = function(x) gsub("_ENSG\\d+", "", x)) %>%
as.data.frame()
(Suryawanshi et al. 2018) have also provided a list of top30 marker genes in the supplementary data.
markers.surya.vil <- readxl::read_excel("../info/from-papers/suryawanshi_2018/supp-data/aau4788_Data_file_S3.xlsx", sheet = "Villi", range = "A1:G271", trim_ws = TRUE)
markers.surya.dec <- readxl::read_excel("../info/from-papers/suryawanshi_2018/supp-data/aau4788_Data_file_S3.xlsx", sheet = "Deciuda", range = "A1:G331", trim_ws = TRUE)
markers.surya.vil <- dplyr::rename(markers.surya.vil, cluster = `cell type`)
markers.surya.vil$cluster <- paste0("vil.", markers.surya.vil$cluster)
markers.surya.dec$`cluster` <- paste0("dec.", markers.surya.dec$`cluster`)
markers.surya <- rbind(markers.surya.vil, markers.surya.dec)
Clusters 0 and 18
Notes
- Clusters 0 and 18 highly correlated with each other and most similar to each other than to other clusters.
- Both are consistently similar to the decidual stromal cells in both vento and surya datasets.
- The marker genes include Prolactin, IGFBP1 etc, which are typical of decidual stromal cells.
- These two clusters also arise unabiguously from the decidua samples rather than villi samples.
annotation$annotation[annotation$cluster %in% c("clust_00", "clust_18")] <- "dec.DSC"
annotation[annotation$cluster %in% paste0("clust_", c("00", "18")), ]
Marker genes plots
top50marker.plots[["clust_00"]]

top50marker.plots[["clust_18"]]

Go anrichment
enrichGO2("clust_00")
'select()' returned 1:1 mapping between keys and columns
Vento markers for DSC
DotPlot2(features = markers.vento$dS3, title = "vento: dS3")

Surya markers for DSC
DotPlot2(features = markers.surya$gene[markers.surya$cluster == "dec.DSC"], title = "surya markers: DSC")

Clusters 4 and 21
Notes
- Clusters 4 and 21 are highly correlated.
- They consistently get assigned as decidual endothelial cells by both vento and surya datasets.
- More than 75% of the cells in both of these clusters come from the decidua samples.
- Markers genes of both clusters are enriched in GO categories related to epithelium development, endothelium development etc.
- But these two clusters are not as alike as clusters 0 and 18. They are different in terms of their expression of many endothelium related genes like CD34, PROX1, VWF, SPARCL. Cluster 21 looks more like “typical” endothelial cells. Cluster 4 has low expression of many typical endothelial genes.
- Cluster 4 has higher expression of some lymphatic endothelial cells: PROX1, LYVE1. Many genes from the list of surya marker genes for dec.LEC are expressed in cluster 4. This suggests that Cluster 4 may be lymphatic endothelial cells and cluster 21 may be endothelial cells.
annotation$annotation[annotation$cluster %in% c("clust_04")] <- "dec.Endo.L"
annotation$annotation[annotation$cluster %in% c("clust_21")] <- "dec.Endo"
annotation[annotation$cluster %in% paste0("clust_", c("04", "21")), ]
Marker gene plots
top50marker.plots[["clust_04"]]

top50marker.plots[["clust_21"]]

GO enrichment
enrichGO2("clust_04")
'select()' returned 1:1 mapping between keys and columns
2% of input gene IDs are fail to map...
enrichGO2("clust_21")
'select()' returned 1:1 mapping between keys and columns
2% of input gene IDs are fail to map...
Vento markers for endothelial cells
DotPlot2(features = markers.vento$Endo.m, title = "vento: Endo.m")

DotPlot2(features = markers.vento$Endo.f, title = "vento: Endo.f")

DotPlot2(features = markers.vento$Endo.L, title = "vento: Endo.L")

Surya markers for endothelial cells
DotPlot2(features = markers.surya$gene[markers.surya$cluster == "vil.VEC"], title = "surya markers: vil.VEC")

DotPlot2(features = markers.surya$gene[markers.surya$cluster == "dec.VEC"], title = "surya markers: dec.VEC")

DotPlot2(features = markers.surya$gene[markers.surya$cluster == "dec.LEC"], title = "surya markers: dec.LEC")

Cluster 13
Notes
- This cluster is unique in its expression of HLA-G, typical of extravillous trophoblast.
- Several other genes often enriched in EVT are also enriched in this cluster: FN1, PAPPA2, DIO2, etc.
- top 30 markers associated with EVT_2 from vento dataset are also highly and specifically expressed in this cluster.
- It is consistently assigned to be EVT based on vento, surya, and pavli datasets.
Annotation:
Cluster 13: Extravillous trophoblast (vil.EVT)
annotation$annotation[annotation$cluster %in% c("clust_13")] <- "vil.EVT"
annotation[annotation$cluster %in% c("clust_13"), ]
Marker genes plot
top50marker.plots[["clust_13"]]

GO enrichment
enrichGO2("clust_13")
1% of input gene IDs are fail to map...
Vento markers for EVT
DotPlot2(features = markers.vento$EVT_1, title = "vento markers: EVT_1")

DotPlot2(features = markers.vento$EVT_2, title = "vento markers: EVT_2")

Surya marker genes for EVT
DotPlot2(features = markers.surya$gene[markers.surya$cluster == "vil.EVT"], title = "surya markers: vil.EVT")

Other trophoblast clusters (16, 17, 23, 24, 25, 29)
Notes
For more information on expression patterns and markers of different trophoblast types, see (Liu et al. 2018).
- In addition to cluster 13 (EVT), there are 6 other clusters that are likely trophoblast cell types (they express KRT7 and are related clusters). These 6 clusters are organized into two correlated blocks (see “correlations section above”): (16, 17, 25) and (23, 29, 24) of which the former are likely VCT and the latter SCT according to correlation-based annotation.
- Cluster 29 appears to be SCT given its expression of CGA, GDF15, ERVFRD-1 (Syncytin 2) genes. For some reason GO enrichment didn’t work for markers of cluster 29, but it does consistently express SCT markers from vento and surya.
- Cluster 24 doesn’t express ERVFRD-1 much, but it does express most other telltale genes of SCT: CSH1, CSH2, CGA, CSHL1, many PSG genes, GH2, PGF, GDF15, etc. And it is not enriched for expression of VCT markers from vento and surya. It’s enriched GO terms have to do with hormone production and signaling as expected for SCT. Thus, this is most likely an SCT cluster.
- Clusters 16 and 17 don’t express many markers of SCT, but do express many vento and surya markers of VCT: PAGE4, PEG10, XAGE3, etc.
- Cluster 25 has VCT markers as well as a strong signature of cell division genes. This shows up in the GO enrichment as well, where most enriched GO terms are “nuclear division”, “chromosome segregation”, “cell cycle G1/S phase transition” etc. This is likely a cluster of proliferating VCT cells. Incidentally both (Vento-Tormo et al. 2018) and (Liu et al. 2018) have a separate population of VCT with a cell cycle gene signature (VCT_2 in vento).
- Cluster 23 is most similar to SCT of vento but VCT of surya. Indeed this cluster has expression of marker genes for both VCT and SCT (see plots). It expresses PAGE4, PEG10 etc, but also expresses syncytin gene ERVW-1. This could represent a group of cytotrophoblast cells that are in the process of differentiating to sycnytial trophoblast. We will label it as a VCT type.
annotation$annotation[annotation$cluster %in% c("clust_16")] <- "vil.VCT_1"
annotation$annotation[annotation$cluster %in% c("clust_17")] <- "vil.VCT_2"
annotation$annotation[annotation$cluster %in% c("clust_23")] <- "vil.VCT_3"
annotation$annotation[annotation$cluster %in% c("clust_25")] <- "vil.VCT_4"
annotation$annotation[annotation$cluster %in% c("clust_24")] <- "vil.SCT_1"
annotation$annotation[annotation$cluster %in% c("clust_29")] <- "vil.SCT_2"
annotation$notes[annotation$cluster %in% c("clust_25")] <- "proliferating"
annotation[annotation$cluster %in% paste0("clust_", c("16", "17", "23", "25", "24", "29")), ]
Marker gene plots
top50marker.plots[["clust_16"]]

top50marker.plots[["clust_17"]]

top50marker.plots[["clust_25"]]

top50marker.plots[["clust_23"]]

top50marker.plots[["clust_24"]]

top50marker.plots[["clust_29"]]

GO enrichment
enrichGO2("clust_16")
enrichGO2("clust_17")
2.5% of input gene IDs are fail to map...
enrichGO2("clust_25")
3% of input gene IDs are fail to map...
enrichGO2("clust_23")
enrichGO2("clust_24")
1% of input gene IDs are fail to map...
enrichGO2("clust_29")
2% of input gene IDs are fail to map...
Vento markers for VCT and SCT
DotPlot2(features = markers.vento$VCT_1, title = "vento markers: VCT_1")

DotPlot2(features = markers.vento$VCT_2, title = "vento markers: VCT_2")

DotPlot2(features = markers.vento$SCT, title = "vento markers: SCT")

Surya markers for VCT and SCT
DotPlot2(features = markers.surya$gene[markers.surya$cluster == "vil.VCT"], title = "surya markers: vil.VCT")

DotPlot2(features = markers.surya$gene[markers.surya$cluster == "vil.SCT"], title = "surya markers: vil.SCT")

Pavli marker genes for VCT and SCT
DotPlot2(features = c("ADRB1", "PUF60", "SNORDA3A", "PRMT7", "E1F1AY",
"FXYD3", "GRAMD2", "INSL4", "ITGB8", "PAGE4",
"SLC13A4", "SLC22A11"),
title = "pavli markers: VCT_1")

DotPlot2(features = c("XAGE3", "XAGE2", "SERINC5", "S100P", "RASA1",
"MFAP5", "LIN28B", "KRT8", "KRT7", "INS-IGF2", "GCM1",
"GATA3", "ERVW-1", "ERVFRD-1", "EGR1", "EFNA1",
"CYP19A1", "PHLDA2", "ACKR2"),
title = "pavli markers: VCT_2")

DotPlot2(features = c("SEMA3B", "CSH2", "CSHL1", "FCGR2A", "GH2", "HBA1",
"HBA2", "HBB", "HBG1", "HBG2", "HLA-DMA", "HLA-DPA1",
"HLA-DQB1", "HLA-DRA", "HLA-DRB1", "HPGDS", "CGB8",
"LGALS14", "LYVE1", "NPIPB3", "CGB5", "PSG1", "PSG3",
"PSG6", "PSG9", "ZNF117", "ZNF91", "HIST2H2AB",
"RAMP2", "MUC20"),
title = "pavli markers: SCT")

Liu et al markers for trophoblasts
DotPlot2(seur, features = c("GCM1", "CSH1", "KRT7", "TFAP2C", "GATA3", "CDH1", "EGFR", "HLA-G", "MMP2", "RRM2", "CCNB1", "CDK1", "ERVFRD-1", "SLC1A5", "PAGE4", "CGB"), title = "Genes from Liu et al 2018")
The following requested variables were not found: CGB

Clusters 02, 06, 07, 15, 19, 33
Notes
- Most of these clusters represent some kind of fibroblast or a closely related cell type. Most clusters are enriched for GO terms related to ECM regulation.
- Clusters 02 and 15 marker genes are also highly expressed in clusters 00 and 18. Cluster 15 has more than 75% cells coming from decidua samples, and cluster 02 has over 50% cells coming from decidua samples. Decidual stromal cells are known to be actually a few different populations. Even in Vento-tormo et al, they are labeled as dS1–3, of which dS3 are the canonical DSC with PRL and IGFBP1, while dS1 and dS2 are closely related to DSC and are derived from endometrial stromal fibroblasts. These things together suggest that clusters 02 and 15 are the non-PRL varieties of DSC. To distinguish them from DSC, we can call them decidual fibroblasts.
- Marker genes of 06, 07, and 33 show highly correlated gene expressin, i.e. the markers on each of these clusters are also highly expressed in the other two clusters, suggesting that these three clusters a closely related cell populations. All three of these clusters are largely (over 75% of the cells) derived from villi samples, and they are fibroblast-like cells. Thus, these are likely villous fibroblasts.
- Cluster 19 is most similar to PV (perivascular) clusters from vento and SMC (smooth muscle cell) cluster from surya. Indeed, its markers include smooth muscle genes like MGP, MYH11, MYL9 etc, and enriched GO terms for this cluster include “muscle contraction” and “artery morphogenesis”. Thus, this cluster is most likely of perivascular smooth muscle cells of decidual origin (over 75% cells from decidua samples).
annotation$annotation[annotation$cluster %in% c("clust_02")] <- "dec.FB_1"
annotation$annotation[annotation$cluster %in% c("clust_15")] <- "dec.FB_2"
annotation$annotation[annotation$cluster %in% c("clust_06")] <- "vil.FB_1"
annotation$annotation[annotation$cluster %in% c("clust_07")] <- "vil.FB_2"
annotation$annotation[annotation$cluster %in% c("clust_33")] <- "vil.FB_3"
annotation$annotation[annotation$cluster %in% c("clust_19")] <- "dec.SMC"
annotation[annotation$cluster %in% paste0("clust_", c("02", "15", "06", "07", "33", "19")), ]
Marker gene plots
top50marker.plots[["clust_02"]]

top50marker.plots[["clust_15"]]

top50marker.plots[["clust_06"]]

top50marker.plots[["clust_07"]]

top50marker.plots[["clust_19"]]

top50marker.plots[["clust_33"]]

GO enrichment
enrichGO2("clust_02")
10.77% of input gene IDs are fail to map...
enrichGO2("clust_15")
2.56% of input gene IDs are fail to map...
enrichGO2("clust_06")
1% of input gene IDs are fail to map...
enrichGO2("clust_07")
3% of input gene IDs are fail to map...
enrichGO2("clust_19")
2% of input gene IDs are fail to map...
enrichGO2("clust_33")
3% of input gene IDs are fail to map...
Vento markers
DotPlot2(features = markers.vento$PV1, title = "vento markers: PV1")

DotPlot2(features = markers.vento$PV2, title = "vento markers: PV2")

Surya markers
DotPlot2(features = markers.surya$gene[markers.surya$cluster == "vil.FB1"], title = "surya markers: vil.FB1")

DotPlot2(features = markers.surya$gene[markers.surya$cluster == "dec.SMC"], title = "surya markers: dec.SMC")

Cluster 20
Notes
This is most likely a cluster of B cells. Most markers genes are related to B cells: many immunoglobulin gnees, SPIB, MS4A1, CD79A, CD79B, BANK1, etc. It also expresses many genes that are markers for Plasma cluster from vento dataset. GO enrichment also agrees. Even though about half the cells from the cluster come from decidua and villi each, the likely origin of B cells is maternal rather than fetal.
annotation$annotation[annotation$cluster %in% c("clust_20")] <- "dec.Bcells"
annotation[annotation$cluster %in% paste0("clust_", c("20")), ]
Marker genes plot
top50marker.plots[["clust_20"]]

GO enrichment
enrichGO2("clust_20")
1% of input gene IDs are fail to map...
Vento markers
DotPlot2(features = markers.vento$Plasma, title = "vento markers: Plasma")

---
title: "Annotation of clusters in 10x data"
author: "Arun Chavan"
output:
  html_notebook:
    toc: yes
    toc_float: yes
    number_sections: true
bibliography: ../refs.bib
---
Started: 2020-09-02  
Last edited: `r format(Sys.time())`

```{r message=FALSE, warning=FALSE}
### packages ==================================================================
library(tidyverse)

# single cell
library(Seurat)

# rmd
library(kableExtra)

# plotting
library(patchwork)
library(ggthemes)

# go enrichment
library(clusterProfiler)
library(org.Hs.eg.db)
```

We can take a first pass at annotating the clusters from our 10x data by comparing them to the reference dataset that we put together. This was done by compiling celltype-averaged expression values from the following single cell RNA-seq studies: [@pavlicev_single-cell_2017; @vento-tormo_single-cell_2018; @suryawanshi_single-cell_2018]. 

# Data

## Samples
This is the sample information sent by Alice. Not sure about the INP id for `AL09` and `AL10`. The library for `INP188` villi (associated with `AL07`) was problematic, so is not used.  

```{r}
sample.info <- read.delim("../info/from-alice/sample_info.tsv", stringsAsFactors = FALSE)
sample.info %>% kable(caption = "Sample information") %>% kable_styling(full_width = FALSE)
```

## Seurat-processed data
Read the `Seurat`-processed (by Eric) data sent by Alice.  

```{r}
seur <- readRDS("../info/from-alice/placenta.rds")

head(seur@meta.data)
```

The `orig.ident` column in the metadata represents the sample IDs from the sample info table above. We just have rename them to left-pad the numbers for consistency, and set new cluster names as default idents.

```{r}
# left pad orig.idents
seur@meta.data$orig.ident <- gsub("(AL)(\\d)$", "\\10\\2", seur@meta.data$orig.ident)

# add additional metadata (covid status, tissue)
seur@meta.data$covid <- sample.info$covid[match(x = seur@meta.data$orig.ident, 
                                                table = sample.info$sample_id)]
seur@meta.data$tissue <- sample.info$tissue[match(x = seur@meta.data$orig.ident,
                                                  table = sample.info$sample_id)]

# rename clusters for consistency
seur$seurat_clusters <- paste0("clust_", seur$seurat_clusters)
seur$seurat_clusters <- gsub("(clust_)(\\d$)", "\\10\\2", as.character(seur$seurat_clusters)) # left pad
seur$seurat_clusters <- factor(seur$seurat_clusters, levels = paste0("clust_", c("00", "01", "02", "03", "04", "05", "06", "07", "08", "09", 10:34)))

# set seurat clusters as default idents
Idents(seur) <- seur@meta.data$seurat_clusters
```

## Average by cluster
We have to get transcriptomes averaged by clusters, so that we can compare them with the reference data in order to narrow down annotations. For averaging, we will use `sctransform` normalized data since this is what we use for marker gene detection down the line.

```{r message=FALSE}
# average sctransformed expression
seur.avg <- AverageExpression(seur)

# averaged data to use for correlations (SCT transformed)
plac <- seur.avg$SCT
```

## Reference data
```{r}
# read reference datasets
ref <- read.csv("../results/01_reference-atlas/vento-surya-pavli_joined.csv", stringsAsFactors = FALSE)
```

## Get data in shape
We will join the data with the reference data so they are both in the same dataframe. 

```{r}
# add gene names column
plac$gene_name <- rownames(plac)

# inner join
plac <- dplyr::inner_join(plac, ref, by = "gene_name")

head(plac)
```

# Correlations within the data
Before moving to annotating clusters by comparison with the reference dataset, first we will look at the correlation structure within the dataset.

## Correlation matrix
```{r fig.asp=0.7}
# build correlation matrix from expression data
cor.matrix <- cor(plac %>% dplyr::select(starts_with("clust")), method = 'spearman')

# reorder correlation matrix based on clustering
dd <- as.dist((1 - cor.matrix)) 
hc <- hclust(dd, method = 'complete')
cormat <- cor.matrix[hc$order, hc$order]

# melt correlation matrix
dat <- reshape2::melt(cormat, na.rm = T)

## plot
p <- ggplot(data = dat, aes(Var1, Var2, fill = value)) +
  geom_tile(colour = "white") +
  scale_fill_gradient(low = 'white', high = 'red',
                      name = "Spearman\nCorrelation") +
  coord_fixed(ratio = 1) +
  labs(title = "Correlations among clusters") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, size = 8, 
                                   hjust=1, vjust = 0.5),
        axis.text.y = element_text(size=8),
        axis.ticks.length = unit(0.15, units = c('lines')),
        legend.title = element_text(size = 10),
        axis.title = element_blank(),
        panel.grid = element_blank(),
        panel.border = element_blank(),
        legend.key.size = unit(0.8, units = c('lines')))

ggsave(p, filename = "../results/02_annotation/plots/corrplot_clusters.pdf", 
       device = "pdf", width = 7, height = 6, units = "in")

print(p)
```

## Heirarchical clustering
```{r}
# build correlation matrix from expression data
cor.matrix <- cor(plac %>% dplyr::select(starts_with("clust")), method = 'spearman')

# reorder correlation matrix based on clustering
dd <- as.dist((1 - cor.matrix)) 
hc <- hclust(dd, method = 'complete')

pdf(file = "../results/02_annotation/plots/hclust_clusters.pdf", width = 7, height = 5)
plot(hc, cex = 0.8)
dev.off()
plot(hc, cex = 0.8)
```

There are three big cluster blocks, with substructure within them. The cluster blocks also correspond well with the UMAP plot that was sent by Alice. 

# Annotation by correlation
Let's now actually measure correlations between all clusters with the reference datasets. We will treat our data as query dataset and for each cluster assign top 3 cell types in each reference dataset.

```{r}
refdata <- c("vento", "surya", "pavli")

# build correlation matrix from expression data
cor.matrix <- cor(plac[, names(plac)[names(plac) != "gene_name"]], 
                  method = 'spearman')

# reorder correlation matrix based on clustering
dd <- as.dist((1 - cor.matrix))
hc <- hclust(dd, method = 'complete')
cor.matrix <- cor.matrix[hc$order, hc$order]

# melt correlation matrix
cormat <- reshape2::melt(cor.matrix, na.rm = T)
cormat$Var1_source <- sapply(strsplit(as.character(cormat$Var1), split = "_"), "[[", 1)
cormat$Var2_source <- sapply(strsplit(as.character(cormat$Var2), split = "_"), "[[", 1)

# subset
cormat <- cormat[which(cormat$Var1_source %in% refdata & 
                   cormat$Var2_source == "clust"), ]

# top 3 matches with highest correlation coefficients
topmatch <- data.frame( # empty df to fill top matches from the loop below
  cluster = unique(as.character(cormat$Var2)) %>% sort(),
  vento.top1 = NA,
  surya.top1 = NA,
  pavli.top1 = NA,
  vento.top2 = NA,
  surya.top2 = NA,
  pavli.top2 = NA,
  vento.top3 = NA,
  surya.top3 = NA,
  pavli.top3 = NA
)

cormat$top3 <- NA

for(i in unique(cormat$Var2)){
  for(j in refdata){
    # identify top3 match indices
    ind <- which(cormat$Var2 == i & cormat$Var1_source == j)
    val <- cormat$value[ind]
    top3ind <- ind[order(val, decreasing = TRUE)[1:3]]
    
    # assign match ranking to correlation data for plotting
    cormat$top3[top3ind[1]] <- "1"
    cormat$top3[top3ind[2]] <- "2"
    cormat$top3[top3ind[3]] <- "3"
    
    # assign top matches to topmatches dataframe
    topmatch[topmatch$cluster == i, paste0(j, ".top1")] <- gsub(paste0(j, "_"), "", as.character(cormat$Var1[top3ind[1]]))
    topmatch[topmatch$cluster == i, paste0(j, ".top2")] <- gsub(paste0(j, "_"), "", as.character(cormat$Var1[top3ind[2]]))
    topmatch[topmatch$cluster == i, paste0(j, ".top3")] <- gsub(paste0(j, "_"), "", as.character(cormat$Var1[top3ind[3]]))
  }
}

```

## Plots

```{r}
# plotting function
clustAnnoPlot <- function(dat, query, reference){
  
  dat <- dat[which(dat$Var2_source == query & dat$Var1_source == reference), ]
  
  p <- ggplot(data = dat, 
              aes(Var1, Var2, fill = value)) +
    geom_tile(colour = "white") +
    scale_fill_gradient(low = 'white', high = 'red',
                        name = "Spearman\nCorrelation") +
    geom_point(aes(Var1, Var2, alpha = top3),
               size = 1.5, shape = 19, stroke  = 0) +
    scale_alpha_manual(values = c(1, 0.5, 0.25), 
                       breaks = c(1, 2, 3),
                       name = "top3", na.value = 0) +
    coord_fixed(ratio = 1) +
    xlab("reference") +
    ylab("query") +
    labs(caption = "For each celltype in query, black points represent top 3 celltypes from reference with highest correlation.",
         title = paste0(query, " vs. ", reference)) +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, size = 8, 
                                     hjust=1, vjust = 0.5),
          axis.text.y = element_text(size=8),
          axis.ticks.length = unit(0.15, units = c('lines')),
          legend.title = element_text(size = 10),
          axis.title = element_text(size = 8),
          plot.caption = element_text(size = 7),
          panel.grid = element_blank(),
          panel.border = element_blank(),
          legend.key.size = unit(0.8, units = c('lines')))
  
  return(p)
}

```

### Against Vento-Tormo et al
```{r fig.asp=0.9}
## Annotation plots
anno.vento <- clustAnnoPlot(dat = cormat, query = "clust", reference = "vento")
cowplot::ggsave2(anno.vento, device = "pdf", width = 7.5, height = 6.5, units = "in",
                 filename = "../results/02_annotation/plots/annotation-by-correlation_ref-vento.pdf")
print(anno.vento)
```

### Against Suryavanshi et al
```{r fig.asp=0.9}
## Annotation plots
anno.surya <- clustAnnoPlot(dat = cormat, query = "clust", reference = "surya")
cowplot::ggsave2(anno.surya, device = "pdf", width = 7.5, height = 6.5, units = "in",
                 filename = "../results/02_annotation/plots/annotation-by-correlation_ref-surya.pdf")
print(anno.surya)
```

### Against Pavlicev et al
```{r fig.asp=0.9}
## Annotation plots
anno.pavli <- clustAnnoPlot(dat = cormat, query = "clust", reference = "pavli")
cowplot::ggsave2(anno.pavli, device = "pdf", width = 7, height = 6, units = "in",
                 filename = "../results/02_annotation/plots/annotation-by-correlation_ref-pavli.pdf")
print(anno.pavli)
```

## Top matches
```{r}
topmatch %>% kable(caption = "Top 3 matches against all reference datasets.") %>% kable_styling(full_width = FALSE)

```

## Labels for clusters
We need to create consistent labels for all cell types that we identify. Below is a `list` I created in which the top level objects are labels we will give to the cell types, within which contained are lists of corresponding labels from `vento` and `surya`. These are tentative labels. After the first pass, when we go through the clusters with a fine-toothed comb, we can modify them further. For example, if we have multiple clusters labeled as `dec.Mac` (decidual macrophages), and if those clusters are distinct, we can then break up this label into `dec.Mac1` and `dec.Mac2`. 

```{r}
labs <- list("dec.DSC"    = list("vento.lab" = c("dS1", "dS2", "dS3"),
                                 "surya.lab" = c("dec.DSC", "dec.FB1", "dec.FB2")),
             "dec.DC"     = list("vento.lab" = c("DC1", "DC2"),
                                 "surya.lab" = c("dec.DC1", "dec.DC2")),
             "dec.Mac"    = list("vento.lab" = c("dM1", "dM2", "dM3"),
                                 "surya.lab" = c("dec.MAC")),
             "dec.NK"     = list("vento.lab" = c("dNK.p", "dNK1", "dNK2", "dNK3", "NK.CD16neg", "NK.CD16pos"),
                                 "surya.lab" = c("dec.NK1", "dec.NK2")),
             "dec.SMC"    = list("vento.lab" = c("dP1", "dP2"),
                                 "surya.lab" = c("dec.SMC")),
             "dec.Endo"   = list("vento.lab" = c("Endo.m"),
                                 "surya.lab" = c("dec.VEC")),
             "dec.Epi"    = list("vento.lab" = c("Epi1", "Epi2"),
                                 "surya.lab" = c("dec.EEC")),
             "dec.Tcell"  = list("vento.lab" = c("Tcells"),
                                 "surya.lab" = c("dec.TC")),
             "vil.Endo"   = list("vento.lab" = c("Endo.f"),
                                 "surya.lab" = c("vil.VEC")),
             "vil.EVT"    = list("vento.lab" = c("EVT"),
                                 "surya.lab" = c("vil.EVT")),
             "vil.FB"     = list("vento.lab" = c("fFB1", "fFB2"),
                                 "surya.lab" = c("vil.FB1", "vil.FB2", "vil.FB3")),
             "vil.Hofb"   = list("vento.lab" = c("HB"),
                                 "surya.lab" = c("vil.HC")),
             "vil.SCT"    = list("vento.lab" = c("SCT"),
                                 "surya.lab" = c("vil.SCT")),
             "vil.VCT"    = list("vento.lab" = c("VCT"),
                                 "surya.lab" = c("vil.VCT")),
             "unk.Gran"   = list("vento.lab" = c("Granulocytes"),
                                 "surya.lab" = c()),
             "unk.ILC"    = list("vento.lab" = c("ILC3"),
                                 "surya.lab" = c()),
             "unk.Mono"   = list("vento.lab" = c("MO"),
                                 "surya.lab" = c()),
             "unk.Plasma" = list("vento.lab" = c("Plasma"),
                                 "surya.lab" = c()),
             "unk.EB"     = list("vento.lab" = c(),
                                 "surya.lab" = c("vil.EB")),
             "unk.Endo.L" = list("vento.lab" = c("Endo.L"),
                                 "surya.lab" = c("dec.LEC"))
)

```

## Matches consistent between reference datasets
The following clusters have consistent top1 match between `vento` and `surya` references, i.e. they correspond to the same cell type category. The `pavli` reference is too unresolved to be used diagnostically, so we will ignore it for now. For some cell types it is confirmatory, though, e.g. cluster_00 corresponds to DSC in all three references. 

```{r}
# find out which clusters are consistent.
# If vento.top1 and surya.top1 are found in the same item in the labs list, the mapping is consistent.
consistent <- c(NA)
for(i in 1:nrow(topmatch)){
  consistent[i] <- grep(topmatch$vento.top1[i], labs) == grep(topmatch$surya.top1[i], labs)
}

# subset to include consistent set
topmatch.cons <- topmatch %>% 
  filter(consistent) %>% 
  dplyr::select(cluster, vento.top1, surya.top1)

# add our tentative labels to the clusters
for(i in 1:nrow(topmatch.cons)){
  topmatch.cons$label[i] <- names(labs)[grep(topmatch.cons$vento.top1[i], labs)]
}

# print
topmatch.cons[order(topmatch.cons$label), ] %>% knitr::kable() %>% kable_styling(full_width = FALSE)
```

## Ambiguous clusters
The top1 matches for some clusters with `vento` and `surya` are not the same. The are the clusters that will require a more detailed look to determine their identify. 

```{r}
# subset for ambiguous clusters
topmatch.ambig <- topmatch %>% 
  filter(!consistent) %>% 
  dplyr::select(cluster, vento.top1, surya.top1)

# add our labels.
for(i in 1:nrow(topmatch.ambig)){
  topmatch.ambig$label[i] <- paste0(
    names(labs)[grep(topmatch.ambig$vento.top1[i], labs)],
    " or ",
    names(labs)[grep(topmatch.ambig$surya.top1[i], labs)]
    )
}

# print
topmatch.ambig[order(topmatch.ambig$label), ] %>% kable() %>% kable_styling(full_width = FALSE)
```

# Annotation refinement
The annotations we have so far are only a starting point. Now we can look at each annotation more carefully to make sure that everything makes sense.

## Sample contribution to clusters
One of the ways in which we can refine the clusters further is by knowing which tissue that sample arises from. For example, if a cluster has macrophage gene signature and if most cells in that cluster come from villi samples, we can further narrow down the identity of the cluster to Hofbauer cells. We can do this by simply counting for each cluster how many cells come from decidua vs villi and by calculating the percentage. 

```{r fig.asp=1.2, fig.width=6.5}
# count cells in each cluster by tissue of origin
bytissue <- table(seur$tissue, seur$seurat_clusters) %>% 
  as.data.frame() %>% 
  dplyr::rename(tissue = Var1, cluster = Var2, frequency = Freq)

# calculate fraction
for(i in 1:nrow(bytissue)){
  bytissue$fraction[i] <- bytissue$frequency[i]/
    sum(bytissue$frequency[bytissue$cluster == bytissue$cluster[i]])
}

# plot
p.cells <- ggplot(data = bytissue, aes(x = frequency, y = cluster)) +
  geom_bar(aes(fill = tissue), stat = "identity", position = "stack") +
  ggtitle("Number of cells")

p.frac <- ggplot(data = bytissue, aes(x = fraction, y = cluster)) +
  geom_bar(aes(fill = tissue), stat = "identity", position = "stack") +
  ggtitle("Fraction of cells")

p.cells + p.frac + plot_layout(guides = "collect", nrow = 2) +
  plot_annotation(caption = "Sample contribution to clusters") &
  coord_flip() &
  scale_fill_tableau(palette = "Classic 10 Medium") &
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        panel.grid.major = element_line(size = 0.25),
        plot.title = element_text(size = 10),
        axis.text.x = element_text(angle = 90, vjust = 0.5))
  

```
Following clusters have more than 75% cells coming from the same tissue of origin. 

```{r}
bytissue[, c("cluster", "tissue", "fraction")] %>% 
  pivot_wider(names_from = "tissue", values_from = "fraction") %>% 
  filter(decidua > 0.75 | villi > 0.75) %>% 
  kable(digits = 2, row.names = FALSE, 
        caption = "Clusters with more than 75% cells from one tissue") %>%
  kable_styling(full_width = FALSE)
```

Following clusters have ambiguous origin, i.e. they contain cells from decidua and villi samples. 

```{r}
bytissue[, c("cluster", "tissue", "fraction")] %>% 
  pivot_wider(names_from = "tissue", values_from = "fraction") %>% 
  filter(decidua < 0.75 & villi < 0.75) %>% 
  kable(digits = 2, row.names = FALSE, 
        caption = "Clusters with cells from both tissues") %>%
  kable_styling(full_width = FALSE)
```

This doesn't necessarily clarify too many things. There are many clusters that have over 3/4th of the cells coming from the same tissue, but often not the tissue you expect. For example clusters 16 and 17 are either villous or syncytial trophoblast cells but they are coming mostly from decidua samples. You expect that for cluster 13, which is extravillous trophoblast but not for the other types. At least in the species that I have worked with, it is often difficult to separate the fetal tissue from maternal tissue. So this mixed result could be a result of that difficulty. It looks like we can use this a rough guide, but we can't necessarily rely of this analysis, i.e. the marker gene should take precedence over tissue of origin in cell type id because the latter is experimentally (and as a consequence, analytically) messy to disentangle.  

## Marker genes
### Identify marker genes 
I ran `Seurat::findAllMarkers` on this data using the `SCT` assay. This function outputs a list of marker genes for each cluster compared to all other cells in the data. This takes a while (hours) to run, so I ran it on the cluster and saved the results to a `.csv`. See the `code` directory for full code, and see below for the function call. 

```{r echo=TRUE, eval=FALSE}
# set default assay to SCT
DefaultAssay(object = plac) <- "SCT"

# find markers for all clusters
markers <- FindAllMarkers(object = plac, 
                          only.pos = TRUE,
                          logfc.threshold = 0.25)  

# write.csv
write.csv(markers, "/home/arc78/scratch60/covid-placenta/markers_sct.csv", row.names = FALSE)
```

```{r}
markers <- read.csv("../results/02_annotation/files/markers_sct.csv")

# rename clusters
markers$cluster <- paste0("clust_", markers$cluster)
markers$cluster <- gsub("(clust_)(\\d)$", "\\10\\2", markers$cluster)
markers$cluster <- factor(markers$cluster, levels = unique(markers$cluster))

# split by cluster
markers <- split.data.frame(markers, markers$cluster)
```

### Plot marker genes
We will plot top 50 marker genes for all clusters and save the plot to pdf. 
```{r}
# plotting function
DotPlot2 <- function(object = "seur", assay = "SCT", features, title, ...) {
  p <- DotPlot(seur, 
               assay = "SCT",
               features = features, 
               dot.min = 0.05, dot.scale = 4) +
    coord_flip() +
    labs(caption = paste0(assay, " normalized expression"), title = title) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
          panel.grid = element_blank())
  
  return(p)
}
```

```{r fig.asp=1}
top50marker.plots <- list()
for(i in names(markers)) {
  if(nrow(markers[[i]]) > 50)
    features <- markers[[i]]$gene[1:50]
  if(nrow(markers[[i]]) < 50)
    features <- markers[[i]]$gene

  top50marker.plots[[i]] <- DotPlot2(features = features, title = i)
         
  cowplot::ggsave2(top50marker.plots[[i]], 
                   filename = paste0("../results/02_annotation/plots/top50markers_dotplot_", i, ".pdf"),
                   width = 7.5, height = 7.5, units = "in")
}
```

# Manual annotation
With the marker gene plots we have made, we can examine the top marker genes in each cluster to manually confirm or adjust the annotation of each cluster.

Let's make a dataframe in which to keep track of the assignments.

```{r}
annotation <- data.frame(
  cluster = unique(seur$seurat_clusters) %>% sort(),
  annotation = NA,
  notes = NA
)
```

For each set of marker genes, we can also perform a number of manual checks to assess cell type annotation.  

First is GO enrichment using `clusterProfile` package, for which below is a function for quickly looking at top 10 enriched GO categories.
```{r}
# function for GO enrichment
# 1. for a given cluster, prints n.print.rows top enriched GO terms
# 2. Can exclude ribosomal genes (default) because they dominate some GO enrichments
enrichGO2 <- function(clust, ngenes = 100, include.ribo = FALSE, n.print.rows = 10, ...) {
  
  ids.bitr <- bitr(markers[[clust]]$gene[1:ngenes], 
                fromType = "SYMBOL", toType = c("ENTREZID", "SYMBOL"), 
                OrgDb = "org.Hs.eg.db")
    
  if(include.ribo == FALSE)
    features <- ids.bitr$ENTREZID[!(grepl("RP[SL]", ids.bitr$SYMBOL))]
  if(include.ribo == TRUE)
    features <- ids.bitr$ENTREZID

  ego <- enrichGO(
    gene          = features,    
    OrgDb         = org.Hs.eg.db,
    ont           = "BP",
    pAdjustMethod = "BH",
    pvalueCutoff  = 0.01, # default 0.01
    qvalueCutoff  = 0.05, # default 0.05
    readable = TRUE)
  
  dftoprint <- clusterProfiler::simplify(ego)@result[1:n.print.rows, c(2,8)]
  rownames(dftoprint) <- NULL
  
  return(dftoprint)
}
```

We can also check if the cluster is enriched is genes that are labeled as markers genes for a cell type by other studies. [@vento-tormo_single-cell_2018] have provided top 30 marker genes for all cell types identified by them.

```{r message=FALSE}
# read top 30 markers from vento-tormo et al
markers.vento <- readxl::read_excel("../info/from-papers/vento-tormo_2018/41586_2018_698_MOESM1_ESM/Supplementary Table 2.xlsx")

# rename columns
markers.vento <- dplyr::rename(markers.vento,
                               VCT_1 = VCT...2,
                               Tcells_1 = `T cells...5`,
                               Tcells_2 = `T cells...9`,
                               VCT_2 = `VCT...10`,
                               Tcells_3 = `T cells...11`,
                               FB_1 = F1,
                               EVT_1 = `EVT...14`,
                               Tcells_4 = `T cells...15`,
                               EVT_2 = `EVT...16`,
                               NKCD16pos = `Blood NK CD16+`,
                               MO_1 = `MO...22`,
                               MO_2 = `MO...27`,
                               NKCD16neg = `Blood NK CD16-`,
                               FB2 = F2,
                               Endo.m = `Endo (m)`,
                               Endo.L = `Endo L`,
                               Endo.f = `Endo (f)` 
)

# remove ensembl ids
markers.vento <- apply(X = markers.vento, MARGIN = 2, FUN = function(x) gsub("_ENSG\\d+", "", x)) %>% 
  as.data.frame()
```

[@suryawanshi_single-cell_2018] have also provided a list of top30 marker genes in the supplementary data. 
```{r}
markers.surya.vil <- readxl::read_excel("../info/from-papers/suryawanshi_2018/supp-data/aau4788_Data_file_S3.xlsx", sheet = "Villi", range = "A1:G271", trim_ws = TRUE)
markers.surya.dec <- readxl::read_excel("../info/from-papers/suryawanshi_2018/supp-data/aau4788_Data_file_S3.xlsx", sheet = "Deciuda", range = "A1:G331", trim_ws = TRUE)

markers.surya.vil <- dplyr::rename(markers.surya.vil, cluster = `cell type`)

markers.surya.vil$cluster <- paste0("vil.", markers.surya.vil$cluster)
markers.surya.dec$`cluster` <- paste0("dec.", markers.surya.dec$`cluster`)

markers.surya <- rbind(markers.surya.vil, markers.surya.dec)
```


## Clusters 0 and 18
### Notes
1. Clusters 0 and 18 highly correlated with each other and most similar to each other than to other clusters. 
2. Both are consistently similar to the decidual stromal cells in both vento and surya datasets. 
3. The marker genes include Prolactin, IGFBP1 etc, which are typical of decidual stromal cells. 
4. These two clusters also arise unabiguously from the decidua samples rather than villi samples.

```{r}
annotation$annotation[annotation$cluster %in% c("clust_00", "clust_18")] <- "dec.DSC"

annotation[annotation$cluster %in% paste0("clust_", c("00", "18")), ]
```

### Marker genes plots
```{r fig.asp=1}
top50marker.plots[["clust_00"]]
```

```{r fig.asp=1}
top50marker.plots[["clust_18"]]
```

### Go anrichment
```{r}
enrichGO2("clust_00")
```

### Vento markers for DSC
```{r fig.asp=0.8, warning=FALSE}
DotPlot2(features = markers.vento$dS3, title = "vento: dS3")
```

### Surya markers for DSC
```{r fig.asp=0.8, warning=FALSE}
DotPlot2(features = markers.surya$gene[markers.surya$cluster == "dec.DSC"], title = "surya markers: DSC")
```

## Clusters 4 and 21
### Notes

1. Clusters 4 and 21 are highly correlated. 
2. They consistently get assigned as decidual endothelial cells by both vento and surya datasets.
3. More than 75% of the cells in both of these clusters come from the decidua samples.
4. Markers genes of both clusters are enriched in GO categories related to epithelium development, endothelium development etc. 
5. But these two clusters are not as alike as clusters 0 and 18. They are different in terms of their expression of many endothelium related genes like CD34, PROX1, VWF, SPARCL. Cluster 21 looks more like "typical" endothelial cells. Cluster 4 has low expression of many typical endothelial genes. 
6. Cluster 4 has higher expression of some lymphatic endothelial cells: PROX1, LYVE1. Many genes from the list of surya marker genes for dec.LEC are expressed in cluster 4. This suggests that Cluster 4 may be lymphatic endothelial cells and cluster 21 may be endothelial cells. 

```{r}
annotation$annotation[annotation$cluster %in% c("clust_04")] <- "dec.Endo.L"
annotation$annotation[annotation$cluster %in% c("clust_21")] <- "dec.Endo"

annotation[annotation$cluster %in% paste0("clust_", c("04", "21")), ]
```

### Marker gene plots
```{r, fig.asp=1}
top50marker.plots[["clust_04"]]
```

```{r, fig.asp=1}
top50marker.plots[["clust_21"]]
```

### GO enrichment
```{r}
enrichGO2("clust_04")
```

```{r}
enrichGO2("clust_21")
```

### Vento markers for endothelial cells
```{r fig.asp=0.8, warning=FALSE}
DotPlot2(features = markers.vento$Endo.m, title = "vento: Endo.m")
```

```{r fig.asp=0.8, warning=FALSE}
DotPlot2(features = markers.vento$Endo.f, title = "vento: Endo.f")
```

```{r fig.asp=0.8, warning=FALSE}
DotPlot2(features = markers.vento$Endo.L, title = "vento: Endo.L")
```

### Surya markers for endothelial cells
```{r fig.asp=0.8, warning=FALSE}
DotPlot2(features = markers.surya$gene[markers.surya$cluster == "vil.VEC"], title = "surya markers: vil.VEC")
```

```{r fig.asp=0.8, warning=FALSE}
DotPlot2(features = markers.surya$gene[markers.surya$cluster == "dec.VEC"], title = "surya markers: dec.VEC")
```

```{r fig.asp=0.8, warning=FALSE}
DotPlot2(features = markers.surya$gene[markers.surya$cluster == "dec.LEC"], title = "surya markers: dec.LEC")
```

## Cluster 13
### Notes

1. This cluster is unique in its expression of HLA-G, typical of extravillous trophoblast. 
2. Several other genes often enriched in EVT are also enriched in this cluster: FN1, PAPPA2, DIO2, etc.
3. top 30 markers associated with EVT_2 from vento dataset are also highly and specifically expressed in this cluster. 
4. It is consistently assigned to be EVT based on vento, surya, and pavli datasets. 

Annotation:  
**Cluster 13: Extravillous trophoblast (vil.EVT)**

```{r}
annotation$annotation[annotation$cluster %in% c("clust_13")] <- "vil.EVT"

annotation[annotation$cluster %in% c("clust_13"), ]
```

### Marker genes plot
```{r fig.asp=1}
top50marker.plots[["clust_13"]]
```

### GO enrichment
```{r message=FALSE}
enrichGO2("clust_13")
```

### Vento markers for EVT
```{r fig.asp=0.8, warning=FALSE}
DotPlot2(features = markers.vento$EVT_1, title = "vento markers: EVT_1")
```

```{r fig.asp=0.8, warning=FALSE}
DotPlot2(features = markers.vento$EVT_2, title = "vento markers: EVT_2")
```

### Surya marker genes for EVT
```{r fig.asp=0.8, warning=FALSE}
DotPlot2(features = markers.surya$gene[markers.surya$cluster == "vil.EVT"], title = "surya markers: vil.EVT")
```

## Other trophoblast clusters (16, 17, 23, 24, 25, 29)
### Notes
For more information on expression patterns and markers of different trophoblast types, see [@liu_single-cell_2018].

1. In addition to cluster 13 (EVT), there are 6 other clusters that are likely trophoblast cell types (they express KRT7 and are related clusters). These 6 clusters are organized into two correlated blocks (see "correlations section above"): (16, 17, 25) and (23, 29, 24) of which the former are likely VCT and the latter SCT according to correlation-based annotation. 
2. Cluster 29 appears to be SCT given its expression of CGA, GDF15, ERVFRD-1 (Syncytin 2) genes. For some reason GO enrichment didn't work for markers of cluster 29, but it does consistently express SCT markers from vento and surya.
3. Cluster 24 doesn't express ERVFRD-1 much, but it does express most other telltale genes of SCT: CSH1, CSH2, CGA, CSHL1, many PSG genes, GH2, PGF, GDF15, etc. And it is not enriched for expression of VCT markers from vento and surya. It's enriched GO terms have to do with hormone production and signaling as expected for SCT. Thus, this is most likely an SCT cluster. 
4. Clusters 16 and 17 don't express many markers of SCT, but do express many vento and surya markers of VCT: PAGE4, PEG10, XAGE3, etc.
5. Cluster 25 has VCT markers as well as a strong signature of cell division genes. This shows up in the GO enrichment as well, where most enriched GO terms are "nuclear division", "chromosome segregation", "cell cycle G1/S phase transition" etc. This is likely a cluster of proliferating VCT cells. Incidentally both [@vento-tormo_single-cell_2018] and [@liu_single-cell_2018] have a separate population of VCT with a cell cycle gene signature (VCT_2 in vento). 
6. Cluster 23 is most similar to SCT of vento but VCT of surya. Indeed this cluster has expression of marker genes for both VCT and SCT (see plots). It expresses PAGE4, PEG10 etc, but also expresses syncytin gene ERVW-1. This could represent a group of cytotrophoblast cells that are in the process of differentiating to sycnytial trophoblast. We will label it as a VCT type. 

```{r}
annotation$annotation[annotation$cluster %in% c("clust_16")] <- "vil.VCT_1"
annotation$annotation[annotation$cluster %in% c("clust_17")] <- "vil.VCT_2"
annotation$annotation[annotation$cluster %in% c("clust_23")] <- "vil.VCT_3"
annotation$annotation[annotation$cluster %in% c("clust_25")] <- "vil.VCT_4"
annotation$annotation[annotation$cluster %in% c("clust_24")] <- "vil.SCT_1"
annotation$annotation[annotation$cluster %in% c("clust_29")] <- "vil.SCT_2"

annotation$notes[annotation$cluster %in% c("clust_25")] <- "proliferating"

annotation[annotation$cluster %in% paste0("clust_", c("16", "17", "23", "25", "24", "29")), ]
```

### Marker gene plots
```{r fig.asp=1}
top50marker.plots[["clust_16"]]
```

```{r fig.asp=1}
top50marker.plots[["clust_17"]]
```

```{r fig.asp=1}
top50marker.plots[["clust_25"]]
```

```{r fig.asp=1}
top50marker.plots[["clust_23"]]
```

```{r fig.asp=1}
top50marker.plots[["clust_24"]]
```

```{r fig.asp=1}
top50marker.plots[["clust_29"]]
```

### GO enrichment
```{r message=FALSE}
enrichGO2("clust_16")
```

```{r message=FALSE}
enrichGO2("clust_17")
```

```{r message=FALSE}
enrichGO2("clust_25")
```

```{r message=FALSE}
enrichGO2("clust_23")
```

```{r message=FALSE}
enrichGO2("clust_24")
```

```{r message=FALSE}
enrichGO2("clust_29")
```

### Vento markers for VCT and SCT
```{r fig.asp=0.8, warning=FALSE}
DotPlot2(features = markers.vento$VCT_1, title = "vento markers: VCT_1")
```

```{r fig.asp=0.8, warning=FALSE}
DotPlot2(features = markers.vento$VCT_2, title = "vento markers: VCT_2")
```

```{r fig.asp=0.8, warning=FALSE}
DotPlot2(features = markers.vento$SCT, title = "vento markers: SCT")
```

### Surya markers for VCT and SCT
```{r fig.asp=0.8, warning=FALSE}
DotPlot2(features = markers.surya$gene[markers.surya$cluster == "vil.VCT"], title = "surya markers: vil.VCT")
```

```{r fig.asp=0.8, warning=FALSE}
DotPlot2(features = markers.surya$gene[markers.surya$cluster == "vil.SCT"], title = "surya markers: vil.SCT")
```

### Pavli marker genes for VCT and SCT
```{r fig.asp=0.4, warning=FALSE}
DotPlot2(features = c("ADRB1", "PUF60", "SNORDA3A", "PRMT7", "E1F1AY", 
                      "FXYD3", "GRAMD2", "INSL4", "ITGB8", "PAGE4",
                      "SLC13A4", "SLC22A11"), 
         title = "pavli markers: VCT_1")
```


```{r fig.asp=0.6, warning=FALSE}
DotPlot2(features = c("XAGE3", "XAGE2", "SERINC5", "S100P", "RASA1",
                      "MFAP5", "LIN28B", "KRT8", "KRT7", "INS-IGF2", "GCM1",
                      "GATA3", "ERVW-1", "ERVFRD-1", "EGR1", "EFNA1", 
                      "CYP19A1", "PHLDA2", "ACKR2"), 
         title = "pavli markers: VCT_2")
```

```{r fig.asp=0.8, warning=FALSE}
DotPlot2(features = c("SEMA3B", "CSH2", "CSHL1", "FCGR2A", "GH2", "HBA1", 
                      "HBA2", "HBB", "HBG1", "HBG2", "HLA-DMA", "HLA-DPA1",
                      "HLA-DQB1", "HLA-DRA", "HLA-DRB1", "HPGDS", "CGB8",
                      "LGALS14", "LYVE1", "NPIPB3", "CGB5", "PSG1", "PSG3", 
                      "PSG6", "PSG9", "ZNF117", "ZNF91", "HIST2H2AB",
                      "RAMP2", "MUC20"), 
         title = "pavli markers: SCT")
```

### Liu et al markers for trophoblasts
```{r fig.asp=0.5}
DotPlot2(seur, features = c("GCM1", "CSH1", "KRT7", "TFAP2C", "GATA3", "CDH1", "EGFR", "HLA-G", "MMP2", "RRM2", "CCNB1", "CDK1", "ERVFRD-1", "SLC1A5", "PAGE4", "CGB"), title = "Genes from Liu et al 2018")
```
## Clusters 02, 06, 07, 15, 19, 33
### Notes

1. Most of these clusters represent some kind of fibroblast or a closely related cell type. Most clusters are enriched for GO terms related to ECM regulation. 
2. Clusters 02 and 15 marker genes are also highly expressed in clusters 00 and 18. Cluster 15 has more than 75% cells coming from decidua samples, and cluster 02 has over 50% cells coming from decidua samples. Decidual stromal cells are known to be actually a few different populations. Even in Vento-tormo et al, they are labeled as dS1--3, of which dS3 are the canonical DSC with PRL and IGFBP1, while dS1 and dS2 are closely related to DSC and are derived from endometrial stromal fibroblasts. These things together suggest that clusters 02 and 15 are the non-PRL varieties of DSC. To distinguish them from DSC, we can call them decidual fibroblasts. 
3. Marker genes of 06, 07, and 33 show highly correlated gene expressin, i.e. the markers on each of these clusters are also highly expressed in the other two clusters, suggesting that these three clusters a closely related cell populations. All three of these clusters are largely (over 75% of the cells) derived from villi samples, and they are fibroblast-like cells. Thus, these are likely villous fibroblasts.
4. Cluster 19 is most similar to PV (perivascular) clusters from vento and SMC (smooth muscle cell) cluster from surya. Indeed, its markers include smooth muscle genes like MGP, MYH11, MYL9 etc, and enriched GO terms for this cluster include "muscle contraction" and "artery morphogenesis". Thus, this cluster is most likely of perivascular smooth muscle cells of decidual origin (over 75% cells from decidua samples). 

```{r}
annotation$annotation[annotation$cluster %in% c("clust_02")] <- "dec.FB_1"
annotation$annotation[annotation$cluster %in% c("clust_15")] <- "dec.FB_2"
annotation$annotation[annotation$cluster %in% c("clust_06")] <- "vil.FB_1"
annotation$annotation[annotation$cluster %in% c("clust_07")] <- "vil.FB_2"
annotation$annotation[annotation$cluster %in% c("clust_33")] <- "vil.FB_3"
annotation$annotation[annotation$cluster %in% c("clust_19")] <- "dec.SMC"

annotation[annotation$cluster %in% paste0("clust_", c("02", "15", "06", "07", "33", "19")), ]
```

### Marker gene plots
```{r fig.asp=1}
top50marker.plots[["clust_02"]]
```

```{r fig.asp=1}
top50marker.plots[["clust_15"]]
```

```{r fig.asp=1}
top50marker.plots[["clust_06"]]
```

```{r fig.asp=1}
top50marker.plots[["clust_07"]]
```

```{r fig.asp=1}
top50marker.plots[["clust_19"]]
```

```{r fig.asp=1}
top50marker.plots[["clust_33"]]
```

### GO enrichment
```{r message=FALSE}
enrichGO2("clust_02")
```

```{r message=FALSE}
enrichGO2("clust_15")
```

```{r message=FALSE}
enrichGO2("clust_06")
```

```{r message=FALSE}
enrichGO2("clust_07")
```

```{r message=FALSE}
enrichGO2("clust_19")
```

```{r message=FALSE}
enrichGO2("clust_33")
```

### Vento markers

```{r fig.asp=0.8, warning=FALSE}
DotPlot2(features = markers.vento$PV1, title = "vento markers: PV1")
```

```{r fig.asp=0.8, warning=FALSE}
DotPlot2(features = markers.vento$PV2, title = "vento markers: PV2")
```

### Surya markers
```{r fig.asp=0.8, warning=FALSE}
DotPlot2(features = markers.surya$gene[markers.surya$cluster == "vil.FB1"], title = "surya markers: vil.FB1")
```

```{r fig.asp=0.8, warning=FALSE}
DotPlot2(features = markers.surya$gene[markers.surya$cluster == "dec.SMC"], title = "surya markers: dec.SMC")
```

## Cluster 20
### Notes
This is most likely a cluster of B cells. Most markers genes are related to B cells: many immunoglobulin gnees, SPIB, MS4A1, CD79A, CD79B, BANK1, etc. It also expresses many genes that are markers for Plasma cluster from vento dataset. GO enrichment also agrees. Even though about half the cells from the cluster come from decidua and villi each, the likely origin of B cells is maternal rather than fetal. 

```{r}
annotation$annotation[annotation$cluster %in% c("clust_20")] <- "dec.Bcells"

annotation[annotation$cluster %in% paste0("clust_", c("20")), ]
```

### Marker genes plot
```{r fig.asp=1}
top50marker.plots[["clust_20"]]
```

### GO enrichment
```{r message=FALSE}
enrichGO2("clust_20")
```

### Vento markers
```{r fig.asp=0.8, warning=FALSE}
DotPlot2(features = markers.vento$Plasma, title = "vento markers: Plasma")
```


# Session Info
```{r}
sessionInfo()
```

# References

